Prosper Marketplace,Inc., which was founded in 2005, is America’s first peer-to-peer lending marketplace, Borrowers request personal loans on Prosper and investors (individual or institutional) can fund anywhere from $2,000 to $35,000 per loan request.
This report explores the Prosper’s loan data, which contains 113,937 loans with 81 variables on each loan, and it was Last updated on 03/11/2014. The dataset will be used for exploring two main ideas 1) Investors’ preferences 2)The relationship between borrowers’characteristics and the risks.
## [1] 113937 36
## 'data.frame': 113937 obs. of 36 variables:
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 5 1 8 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
## $ BorrowerRate : num 0.158 0.092 0.275 0.0974 0.2085 ...
## $ ProsperRating..Alpha. : Factor w/ 8 levels "","A","AA","B",..: 1 2 1 2 6 4 7 5 3 3 ...
## $ ListingCategory..numeric. : int 0 2 0 16 2 1 1 2 7 7 ...
## $ BorrowerState : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
## $ Occupation : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
## $ EmploymentStatus : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
## $ EmploymentStatusDuration : int 2 44 NA 113 44 82 172 103 269 269 ...
## $ IsBorrowerHomeowner : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ FirstRecordedCreditLine : Factor w/ 11586 levels "","1947-08-24 00:00:00",..: 8639 6617 8927 2247 9498 497 8265 7685 5543 5543 ...
## $ CurrentCreditLines : int 5 14 NA 5 19 21 10 6 17 17 ...
## $ OpenCreditLines : int 4 14 NA 5 19 17 7 6 16 16 ...
## $ TotalCreditLinespast7years : int 12 29 3 29 49 49 20 10 32 32 ...
## $ OpenRevolvingAccounts : int 1 13 0 7 6 13 6 5 12 12 ...
## $ InquiriesLast6Months : int 3 3 0 0 1 0 0 3 1 1 ...
## $ CurrentDelinquencies : int 2 0 1 4 0 0 0 0 0 0 ...
## $ AmountDelinquent : num 472 0 NA 10056 0 ...
## $ DelinquenciesLast7Years : int 4 0 0 14 0 0 0 0 0 0 ...
## $ PublicRecordsLast10Years : int 0 1 0 0 0 0 0 1 0 0 ...
## $ BankcardUtilization : num 0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
## $ AvailableBankcardCredit : num 1500 10266 NA 30754 695 ...
## $ TotalTrades : num 11 29 NA 26 39 47 16 10 29 29 ...
## $ TradesNeverDelinquent..percentage.: num 0.81 1 NA 0.76 0.95 1 0.68 0.8 1 1 ...
## $ TradesOpenedLast6Months : num 0 2 NA 0 2 0 0 0 1 1 ...
## $ DebtToIncomeRatio : num 0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
## $ IncomeRange : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
## $ IncomeVerifiable : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ MonthlyLoanPayment : num 330 319 123 321 564 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
The investors can use the invest tool to browse loans by multiple criteria and fund the selected loan requests, and there might be several investors per loan request, so let’s begin with the ‘Investors’ variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.48 115.00 1189.00
Most loans (75%) received the funding from less than 115 investors. Now, let’s dig deeper into the characteristics of loans and borrowers.
The category of the listing is what the borrower selected as the purpose of the loan when posting their listing. In order to make the plot readable, I converted the numeric values to the corresponding factors(labels). Moreover, I also take the square root of the count.
##
## Not Available Debt Consolidation Home Improvement
## 0.1488980753 0.5117564970 0.0652378069
## Business Personal Loan Student Use
## 0.0630962725 0.0210203885 0.0066352458
## Auto Other Baby&Adoption
## 0.0225738785 0.0921035309 0.0017465792
## Boat Cosmetic Procedure Engagement Ring
## 0.0007460263 0.0007986870 0.0019045613
## Green Loans Household Expenses Large Purchases
## 0.0005178300 0.0175184532 0.0076884594
## Medical/Dental Motorcycle RV
## 0.0133582594 0.0026681412 0.0004563926
## Taxes Vacation Wedding Loans
## 0.0077674504 0.0067405672 0.0067668975
Debt Consolidation is the commonest purpose of the loan requests, more than 50% of the borrowers use the funding for debt consolidation.
How about the loan amounts? Let’s taka a look at the origination amount of the loans.
Most of the loan original amount values are between $1,000 and $15,000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6500 8337 12000 35000
The mininmum is $1,000, the maximum is $35,000, and the median as $6.500
Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”). Applicable for loans originated after July 2009.
## A AA B C D E HR
## 29084 14551 5372 15581 18345 14274 9795 6935
Prosper Ratings of A,B and C are three major ratings among the loans.
The Borrower’s interest rate for this loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1928 0.2500 0.4975
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
The PastDue Status was subdivided into six factors by the delinquency periods, and I would like to add a new row “LoanStatusNew” which all the PastDue Status will be grouped by only one factor so that I can use it to find out the correlations between risk and other variables.
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due
## 2067
##
## Cancelled Chargedoff Completed
## 0.0000438839 0.1052511476 0.3341671274
## Current Defaulted FinalPaymentInProgress
## 0.4965551138 0.0440418828 0.0017992399
## Past Due
## 0.0181416046
More than 10% of the loans are charge-off, 4% are defaulted, and nearly 2% of the loans are past due.
Next, I would like to know more about the borrowers.
The state of the address of the borrower at the time the Listing was created, and a heatmap might be a perfect fit for this geographical variable.
With the heatmap above, we can see that California has the largest number of borrowers.
To have a roughly understanding about the length of borrowers’ credit history, I would like to create a new variable, creditlength, by using the difference between two available variables-FirstRecordedCreditLine and ListingCreationDate. As mentioned before, I just want a roughly understanding about the length and see if there is any value for further exlporing, so I only use the year in the variables to calculate. By doing so, we can know how many years the borrowers’lengths of credit history are.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 11.00 16.00 16.68 21.00 63.00 697
The length of credit history shows a normal distribution , and more than 50% of the borrowers have more than 16-year-long credit history.
Credit score represents the creditworthiness of a borrower, and it is also used by banks or other lenders to evaluate the potential risk of a loan. Here, I am going to use CreditScoreRangeLower, the lower value representing the range of the borrower’s credit score as provided by a consumer credit rating agency, to see the borrowers’ credit scores.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 660.0 680.0 685.6 720.0 880.0 591
The credit scores are nearly normally distributed, with means of 685.6, which is much better than I expected.
Now, I would like to look into the credit lines closer. Let’s begin with the utilization of the revolving credits. The variable shows the percentage of available revolving credit that is utilized at the time the credit profile was pulled.
To remove the 1% outliers, I use the quantile function and create the plot with 99% of the data.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.310 0.600 0.561 0.840 5.950 7604
More than a half of the borrowers use 60% of the revolving credits.
Unlike open credit lines, closed-end credit provides a fixed amount of money to finance a specific purpose and period of time. The loan may require periodic principal and interest payments,or payment of the entire principal at the end of the loan term. Many financial institutions refer to closed-end credit as an installment loan or a secured loan. Generally, real estate and auto loans are closed-end credit, and credit cards are revolving lines of credit or open-end. I think closed-end credit line might be a good variable to predict the risk later, so I will create a new variable ‘Closedendcredit’ by two on hand variables, ‘CurrentCreditLines’ and ‘OpenCreditLines’. The diffrence between the variables will be the quantities of the closed-end credit line.
Not all the borrowers have a closed-end credit line, and most closed-end credit lines owners have 1 or 2, but still, we can examine if the closed-end credit lines oweners would cause less charge-off or delinquencies later.
In the next plot, I will explore the percentages of each borrower ’s trades that have never been delinquent at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.820 0.940 0.886 1.000 1.000 7544
It seems that most of the borrowers are successful in repayment.
As for the borrowers’ income, I created the plot below with the variable ‘StatedMonthlyIncome’, which is the monthly income the borrower stated at the time the listing was created.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
The minimum is $0, the median is $4,667, and the maximum is $1,750,000. There are some significant outliers, so I Limited the values to 99% quantile to avoid the outlier issue.
From the plot and summary above, we learn that most of the borrowers’ annuel income is less than $100,000. However, knowing the income is not enough, the debt to income ration will give us a better understanding about the borrowers’ financial status.
The debt to income ratio of the borrower at the time the credit profile was pulled. This value is Null if the debt to income ratio is not available. This value is capped at 10.01 (any debt to income ratio larger than 1000% will be returned as 1001%).
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.140 0.220 0.276 0.320 10.010 8554
The minimum is 0, the median is 0.22, and the maximum is 10.010. Again, there are some significant outliers, so I Limited the values to 99% quantile to avoid the outlier issue.
The debt to income ratio histogram shows a positively skewed, normal distribution, and more than 75% of the borrowers have the debt to income ratio less than 0.32.
There are 113,937 observations in the Prosper loan data with 81 variables. However, I have narrowed the variables down to 35 by removing the similar information. In the univariate analysis section, I employed 19 variables, including 10 continuous Variables and 9 categorical variables.
Continuous Variables: Investors,LoanOriginalAmount,BorrowerRate,BankcardUtilization, TradesNeverDelinquent,StatedMonthlyIncome,DebtToIncomeRatio,PublicRecordsLast10Years, CurrentCreditLines and OpenCreditLines.
Categorical Variables:ListingCategory,ProsperRatings,Term,LoanStatus,BorrowerState, IsBorrowerHomeowner, CreditScoreRangeLower,FirstRecordedCreditLine and ListingCreationDate.
The main features in the dataset is loan status and investors. I position myself as an analyst in Prosper Marketplace and there are two main investigations in this report. First, I’d like to determine which features are best for predicting the risk of charge-off or default, so loan status is the dependent variable of this analysis. Second, I would like know about investors’ preferences, so I will use investors to figure out which variables are the criteria that the investor would value more while choosing the loan requests.
ListingCategory,CreditScoreRangeLower, ProsperRatings, LoanOriginalAmount,Term, and BorrowerRate likely contribute to the average numbers of the investors. As for loan status, I think CreditScoreRangeLower, DebttoIncomeRatio, BorrowerRate, LoanOriginalAmount, StatedMonthlyIncome, TradesNeverDelinquent and Bankcard Utilization probably contribute to the the loss or risk.
I created two new variables, ‘Length of Credit History’ and ‘Closed-end credit line’.
I use the exsting variables, FirstRecordedCreditLine and ListingCreationDate to calculate the borrowes’ lengnth of credit history. Moreover, I also use ‘CurrentCreditLines’ and ‘OpenCreditLines’ to create the new variable ‘Closed-end credit line’.Both of the new variables might be helpful for the risk analysis later.
PublicRecordsLast10Years was a continuous numeric variable, but I coverted it to boolean value, so I can see how many borrowers have public records clearly.
LoanStatus had 12 factors, and 6 of the factors are Pastdue status (different delinquency period ), but I want the data to be simplier, so I combine the 6 pastdue factors as one.
ListingCategory was a set of numbers, which represent different categories: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans. In order to make the plot readable, I converted the numeric factors to string factors.
Instead of histogram, an US map of heatmap can be a more graphic visualization for BorrowerState. To create the heatmap, I needed to seperate the BorrowerState to two row: State.name and numbers in the fisrt place. Later, I have transferred the states’ abbreviation to all lowercase full names and merge both of the rows to the set called “map” in library(map).
In the section, I will focus on two main features of interest in the report-Investors and Loan Status. Unlike the univariate plots above, , I will also start using multiple colors in the bivariate plots for better visualization.
First, what purpose of loans are more popular among investors? There are 20 categories that borrowers can select and we already know that ‘Debt Consolidation’ accounts for 50% based on the univariate plot above. However, to answer the question, the numbers I need here are the average investors for each categories. Therefore, I used the ‘group by’ function to calculate the mean investors of all the categories.
‘Business’ accounts for only 6% of the all listings, but it is the the most popular category among investors. Every ‘Business’ loan can attract more than 120 investors.
Did the investment pay off? Let’s see if the numbers of average investor very in different loan status.
From the plot and the table above, we can learn that the average investors of chargedoff and defaulted loan are more than the loans lebeled with’FinalPaymentInProgress’ and ‘current’ but less tan the completed loans.
Dose the credit scores matter? I use ‘CreditScoreRangeLower’ here as what I did in univariate section.
The black dotted line is the median of the borrowers’ credit score, 680, and it seems that most investors would choose the loan requsts more than the median scores.
Do the investors trust Prosper Rating system? Prosper Rating composed by 7 levels,from lowest-risk to highest-risk, are labeled AA, A, B, C, D, E, and HR (“High Risk”). According to the statistics in the univariate section, we learn that most loan were rated ‘C’. To answer the question, I will use the average investors for each rating. Therefore, I used the ‘group by’ function to calculate the mean investors of all the ratings.
Even though ‘C’ rating is the majority, ‘AA’ is the most popular rating among the investors; ‘E’ and ‘HR’ are undesirable. It seems that most investors think Prosper Rating System is trustworthy and they are conservative with investments.
Do the investors prefer the loan with larger amount or less?
##
## Pearson's product-moment correlation
##
## data: df$LoanOriginalAmount and df$Investors
## t = 138.7077, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3751140 0.3850494
## sample estimates:
## cor
## 0.3800926
Under the amount of $25,000, we can see that there’s a positive linear correlation between investors abd the loan amount, and the correlation coefficient is 0.38.
Does the interest rate motivate investors?
No! The plot above tell us again that the investors on Prosper are conservative. In fact,higher interest rate, less investors.
##
## Pearson's product-moment correlation
##
## data: df$BorrowerRate and df$Investors
## t = -96.2493, df = 113935, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2795783 -0.2688385
## sample estimates:
## cor
## -0.2742169
Let’s calculate the correlation coefficient to look into it closer, the correlation coefficient is -0.274; there’s slightly negative correlation between investors and the interest rate.
For the ‘Investors’ variable, there are relatively strong correlation with ‘ListingCategory’, ‘ProsperRating’ ‘LoanOriginalAmount’ and ‘Term’. Based on the analysis, we can learn that investors prefer the 3-year business loan with AA ratings, and if the amount of the loan is under $25,000, the larger amoumt will attract more investors.
As for the other main feature ‘LoanStatus’ variable, there are relatively strong correlation with ‘DebtToIncomeRatio’, ‘BorrowerRate’, ‘StatedMonthlyIncome’ and ‘TradesNeverDelinquent’. Based on the analysis, we can learn that the borrowers with higher debt-income ratio, higher interest rate, less monthly income and less trades that never delinquent are more likely to fail in repayment.
Yes, the relationship between occupation and the credit score fits the common stereotype well. The decent job titles implies the higher credit scores.
The relationship between Investors and ProsperRating is strong.
## Ratings Investors_per_llisting_by_rating
## 1 116.09971
## 2 A 97.52869
## 3 AA 178.36783
## 4 B 69.80245
## 5 C 51.09054
## 6 D 56.23371
## 7 E 35.01419
## 8 HR 35.27585
The gap between the average investors of ‘AA’ and ‘HR’ is almost 143, which means that each AA rating loan can attract 143 more investors than the ‘HR’ loan. The relationship is not only strong but also significant.
‘Investors’ is one of the most important features of the report, so I would like to see the correlation coefficient between ‘Investors’ and other numerical continuous variables of the data.
From the plot, we can see the pearson’s r between the variables and ‘Investors.’ Next,I will choose the variables which have comparatively stronger correlation with ‘Investors’ and also apply the other main feature in the report ‘Loan Status’ and some other categorical variables to create some multivariate plots.
The interesting finding in the plot above is that starting from the scores 650, even the borrowers have the same scores, the investors are more likely to fund the homeowner’s loans.
As the analysis in 4-18, I will use 4 occupations-Executive, Computer Programmer, Administrative Assistant and Clerical here and I would like to know if the investors have a bias against the borrowers who are not from management level or Computer Programmer.
## Var1 Freq
## 37 Other 28617
## 43 Professional 13628
## 14 Computer Programmer 4478
## 21 Executive 4311
## 61 Teacher 3759
## 3 Administrative Assistant 3688
The finding here is interesting but cruel: Computer Programmer is the most popular occupation from the scores 600 to 700, and after 700, Executive become the most popular occpation gradually .
Even the borrowers have the same credit scores, the investors prefer the borrowers with decent job titles.
Last, let’s add more demographic variables to the analysis. I would like to use the other main feature of the report ’LoanStatus’to see the borrowers from which State are more likely to successfully repay to the loans.
First, I’d like to narrow the 50 States down to top 5 States by the amount of the borrowers based on the analysis in the univariate section. The top 5 States include California, Texas, Florida, New York and Illinois.
In order to calculate the risky loan rate , I would also simplify the levels of the loan status, so I will combine the status ‘Chargedoff’ and ‘Defualted’ as a new level ‘Loss’ and the rename the ‘Past Due’ level ‘Risk’. Here, I will use the sum of ‘Loss’ and ‘Risk’ to calculate the risky loan rate.
To make the analysis more accurate, I also use the ‘IncomeRange’ as another variable here. Thus, we will learn that if the borrowers in the same income range will have different repayment behaviour by different residential State.
## Var1 Freq
## 6 CA 14717
## 45 TX 6842
## 36 NY 6729
## 11 FL 6720
## 16 IL 5921
## 1 5515
The informative plot above shows us that in the lowest income range, the borrowers from New York are more likely to have better repayment behaviour; in the highest income range, the borrowers from Taxes are more likely to have better repayment behaviour.
‘Investors’ VS. ‘Occupation’ VS.‘CreditScoreRangeLower’ is extended from the 4-18 plot in the bivariate section. I never thought that the ‘Occupation’ is so important to the investors, and the magnitude of the relationship between investors and occupation become more significant while we control the ‘credit score’ variable.
In the 6-4 ‘LoanStatus’ VS. ‘BorrowerState’ VS.‘IncomeRange’, I found that in general, the borrowers from Taxes and New York have better repayment behaviors, and the borrowers from Illinois have the worst one. It’s quite interesting that the geographic variable might also be a predictor variable in the analysis.
Laon listings under the category ‘Business’ have the most mean investors; and loan listings under the category ‘Household Expenses’ have the least mean investors. Investors incline to trust athe loan specifically intended for business purposes instead of a personal fixed expense.
The income range is positively and strongly correlated with risky loan rate, the higher income borrowers have less risky loan rate. The variables Borrower’s State also correlate with the risky loan rate, we can see the borrowers in some State have relatively higher risky loan rate in various income range. Thus, income range and Borrower’s State can be used in a model to predict the risk of a loan.
The plot indicates that a linear model could be constructed to predict the average investors of variables using 99.5% investors as the outcome variable and credit scores as the predictor variable. Holding credit scores constant, the borrowers with more respectable job titils, such as computer programmer can attract more investors than borrowers with less decent job titles to account for additional variability in investors.
The Prosper loan dataset contains information on 113,937 observations across 81 variables. I started by understanding the individual variables in the data set, and then I picked up 36 variables to explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the investors and loan status across several variables.
While analysing the dataset, one thing made me anxious: the correlations between main features and other variables are not so significant, the highest Pearson’s R I’ve seen is around 0.4. I expected the correlations between variables can be as strong as the tutorial materials, and the plots can easily be interpreted by themselves, so the fact After consulting with my data scientist friends, they told me that for the data in real world, Pearson’s R around 0.4 is relatively high, and it’s really difficult to see a 0.8 or 0.9 ones. Thus, I feel much more relieved about the dataset and have the confidence to exploring the data.
I found success while creating multivariate plots. As I mentioned above, I found the correlation between two variables are not strong and hard to be awared. However, when I added one more variable into the analysises, holding one of the variables constant, the correlation become much easier to be awared. For example, the correlation with ‘IsBorrowerHomeowner’ and ’Investors’are not notable in the bivariate section, however, when I added the credit score as the third variables, I can clearly see that homeowners somehow can attract more investors.
With the analysis above, I would like to know if the local ecoomy will affect the risky loan rate. In the multivariate section, I found that the borrowers in certain States are likely to have worse repayment behaviors, which might be caused by local economy. If I have some local economic indicators, such as as Consumer confidence index, I would like to apply the data to the analysis and examine the correlation between the local economy and the repayment behaviors there.